In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its opposite (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized. The loss function could include terms from several levels of the hierarchy.
In statistics, typically a loss function is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. The concept, as old as Laplace, was reintroduced in statistics by Abraham Wald in the middle of the 20th century. In the context of economics, for example, this is usually economic cost or regret. In classification, it is the penalty for an incorrect classification of an example. In actuarial science, it is used in an insurance context to model benefits paid over premiums, particularly since the works of Harald Cramér in the 1920s. In optimal control, the loss is the penalty for failing to achieve a desired value. In financial risk management, the function is mapped to a monetary loss.
Many common , including , regression models, design of experiments, and much else, use least squares methods applied using linear regression theory, which is based on the quadratic loss function.
The quadratic loss function is also used in linear-quadratic optimal control problems. In these problems, even in the absence of uncertainty, it may not be possible to achieve the desired values of all target variables. Often loss is expressed as a quadratic form in the deviations of the variables of interest from their desired values; this approach is tractable because it results in linear first-order conditions. In the context of stochastic control, the expected value of the quadratic form is used. The quadratic loss assigns more importance to outliers than to the true data due to its square nature, so alternatives like the Huber loss, log-cosh and SMAE losses are used when the data has many large outliers.
using Iverson bracket notation, i.e. it evaluates to 1 when , and 0 otherwise.
Here, θ is a fixed but possibly unknown state of nature, X is a vector of observations stochastically drawn from a population, is the expectation over all population values of X, dP θ is a probability measure over the event space of X (parametrized by θ) and the integral is evaluated over the entire support of X.
where m(x) is known as the predictive likelihood wherein θ has been "integrated out," * (θ | x) is the posterior distribution, and the order of integration has been changed. One then should choose the action a* which minimises this expected loss, which is referred to as Bayes Risk. In the latter equation, the integrand inside dx is known as the Posterior Risk, and minimising it with respect to decision a also minimizes the overall Bayes Risk. This optimal decision, a* is known as the Bayes (decision) Rule - it minimises the average loss over all possible states of nature θ, over all possible (probability-weighted) data outcomes. One advantage of the Bayesian approach is to that one need only choose the optimal action under the actual observed data to obtain a uniformly optimal one, whereas choosing the actual frequentist optimal decision rule as a function of all possible observations, is a much more difficult problem. Of equal importance though, the Bayes Rule reflects consideration of loss outcomes under different states of nature, θ.
A common example involves estimating "location". Under typical statistical assumptions, the mean or average is the statistic for estimating location that minimizes the expected loss experienced under the least squares loss function, while the median is the estimator that minimizes expected loss experienced under the absolute-difference loss function. Still different estimators would be optimal under other, less common circumstances.
In economics, when an agent is risk neutral, the objective function is simply expressed as the expected value of a monetary quantity, such as profit, income, or end-of-period wealth. For Risk aversion or risk-loving agents, loss is measured as the negative of a utility, and the objective function to be optimized is the expected value of utility.
Other measures of cost are possible, for example Mortality rate or morbidity in the field of public health or safety engineering.
For most optimization algorithms, it is desirable to have a loss function that is globally continuous and differentiable.
Two very commonly used loss functions are the squared loss, , and the absolute loss, . However the absolute loss has the disadvantage that it is not differentiable at . The squared loss has the disadvantage that it has the tendency to be dominated by —when summing over a set of 's (as in ), the final sum tends to be the result of a few particularly large a-values, rather than an expression of the average a-value.
The choice of a loss function is not arbitrary. It is very restrictive and sometimes the loss function may be characterized by its desirable properties.Detailed information on mathematical principles of the loss function choice is given in Chapter 2 of the book (and references there). Among the choice principles are, for example, the requirement of completeness of the class of symmetric statistics in the case of i.i.d. observations, the principle of complete information, and some others.
W. Edwards Deming and Nassim Nicholas Taleb argue that empirical reality, not nice mathematical properties, should be the sole basis for selecting loss functions, and real losses often are not mathematically nice and are not differentiable, continuous, symmetric, etc. For example, a person who arrives before a plane gate closure can still make the plane, but a person who arrives after can not, a discontinuity and asymmetry which makes arriving slightly late much more costly than arriving slightly early. In drug dosing, the cost of too little drug may be lack of efficacy, while the cost of too much may be tolerable toxicity, another example of asymmetry. Traffic, pipes, beams, ecologies, climates, etc. may tolerate increased load or stress with little noticeable change up to a point, then become backed up or break catastrophically. These situations, Deming and Taleb argue, are common in real-life problems, perhaps more common than classical smooth, continuous, symmetric, differentials cases.
|
|